The molecular kinetics community often builds discrete-state models of continuous-state time-series, called Markov State Models (MSMs).

How do we compute the likelihood of a time-series given an MSM?

Let's say we want to compute the likelihood of a vector-valued sequence given an MSM. The MSM is defined by (1) a partitioning of the entire continuous space into $n$ non-overlapping chunks, and (2) an $n \times n$ transition matrix $T$, where each entry $T_{ij}$ is the probability that you'll be anywhere in chunk $j$ if you start anywhere in chunk $i$ and wait a while.

The likelihood of a discrete sequence, $S = (s_1,s_2,\dots,s_N)$ is easy, it's just:

$$p(S) = p(s_1) \prod_{t=2}^N p(s_t | s_{t-1}) = p(s_1) \prod_{t=2}^N T_{s_{t-1} s_t}$$

This is what's currently done in MSMBuilder when you ask for a likelihood, for example: https://github.com/msmbuilder/msmbuilder/blob/master/msmbuilder/msm/msm.py#L318

But in the case of MSMs, this obviously doesn't tell the whole story. You could, for instance, choose a totally useless state-space partitioning that maps any possible vector to the same discrete state $(n=1, T = T_{11}=1.0)$, but evaluating this likelihood function would tell you that the probability of the sequence is $1.0$.

What we really want is the likelihood of a vector-valued sequence, $X = ( \vec{x}_1, \vec{x}_2, \dots, \vec{x}_N)$:

$$p(X) = p(\vec{x}_1) \prod_{t = 2}^N p(\vec{x}_t | \vec{x}_{t-1})$$

Evaluating this from an MSM requires one more step. This is because the probability density at any individual point in a partition is not equal to the probability of the whole partition-- instead the probability density has to integrate to that value. Consequently, the transition probability $p(\vec{x}_t | \vec{x}_{t-1})$ isn't simply equal the corresponding element of $T$. $T_{ij}$ is actually an integral:

$$T_{ij} = \int_{s_j} p(\vec{y} | \vec{x})d\vec{y}$$

where the "origin point" is anywhere in the partition $i$, $\vec{x} \in s_i$, and we integrate over all "destination points" in the partition $j$, $\vec{y} \in s_j$.

Since the transition density in an MSM is uniform within each partition, this integral should be easy. All we need is the volume $V_j$ of each partition:

$$\begin{aligned}T_{ij} & = V_j p(\vec{y} | \vec{x})\\ p(\vec{y} | \vec{x}) & = T_{ij}/{V_j}\end{aligned}$$

for any $\vec{x} \in s_i$, $\vec{y} \in s_j$.

So now we can compute the vector-to-vector transition probability from quantities that the MSM defines: the volume of each partition, and the transition matrix.


In [1]:
# basic imports
import numpy as np
import numpy.random as npr
npr.seed(0)
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# construct a 2-state Markov model, and sample a long trajectory

# the 2 states A and B are Gaussian blobs in 2D
A = (0,0)
B = (5,5)
centers = dict()
centers[0] = A
centers[1] = B

# the transition matrix is symmetric and metastable
trans_mat = np.array([[0.7,0.3],
                      [0.3,0.7]])

# sample a trajectory
traj_len=100000
traj=npr.randn(traj_len,2)
state = 0
for i in range(1,traj_len):
    state = int(npr.rand() > trans_mat[state][0]) # only works for 2-state MSM
    traj[i] += centers[state]

X = traj[:int(traj_len*0.9)]
X_test = traj[int(traj_len*0.9):]

In [3]:
plt.imshow(trans_mat,interpolation='none',cmap='Blues');
plt.xticks([0,1])#,[r'$\mathcal{N}((0,0),1)$',r'$\mathcal{N}((5,5),1)$'])
plt.yticks([0,1])#,[r'$\mathcal{N}((0,0),1)$',r'$\mathcal{N}((5,5),1)$'])
plt.colorbar()
plt.title('True transition matrix')


Out[3]:
<matplotlib.text.Text at 0x1087c2ad0>

In [4]:
# Use a library function for computing a state decomposition.
# In general we don't know the correct number of states beforehand, so let's over-estimate here:
from msmbuilder.cluster import MiniBatchKMedoids
kmed = MiniBatchKMedoids(n_clusters=8)
kmed.fit([X])


Out[4]:
MiniBatchKMedoids(batch_size=100, max_iter=5, max_no_improvement=10,
         metric='euclidean', n_clusters=8, random_state=None)

In [5]:
# assign each point in the trajectory to a partition
clusters = kmed.transform([X])[0]

In [6]:
clusters


Out[6]:
array([7, 0, 0, ..., 4, 0, 5])

In [7]:
plt.scatter(X[:,0],X[:,1],c=clusters, linewidths=0)
plt.title('All points in trajectory, colored by cluster')


Out[7]:
<matplotlib.text.Text at 0x10c345410>

In [8]:
uniform = npr.rand(100000,2)*15-5
uniform_clusters = kmed.transform([uniform])[0]

In [9]:
plt.scatter(uniform[:,0],uniform[:,1],c=uniform_clusters, linewidths=0)
plt.xlim(-5,10)
plt.ylim(-5,10)
plt.colorbar()


Out[9]:
<matplotlib.colorbar.Colorbar instance at 0x10d13cab8>

In [10]:
plt.hist(uniform_clusters,bins=len(set(uniform_clusters)),normed=True);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional volume')


Out[10]:
<matplotlib.text.Text at 0x10d4b6510>

In [11]:
from msmbuilder.msm import MarkovStateModel

In [12]:
msm = MarkovStateModel()
msm.fit(clusters)


MSM contains 1 strongly connected component above weight=1.00. Component 0 selected, with population 100.000000%
Out[12]:
MarkovStateModel(ergodic_cutoff=1.0, lag_time=1, n_timescales=10,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)

In [13]:
clusters


Out[13]:
array([7, 0, 0, ..., 4, 0, 5])

In [14]:
msm.score_ll([clusters])


Out[14]:
-162738.90950436893

In [15]:
msm.score_ll(kmed.transform([X_test]))


Out[15]:
-17961.369947497973

In [16]:
T = msm.transmat_

In [17]:
# T is row-normalized
T[0],np.sum(T[0])


Out[17]:
(array([ 0.393207  ,  0.09053321,  0.04219531,  0.05193111,  0.30181058,
         0.06873631,  0.02912721,  0.02245926]), 1.0000000000000002)

In [18]:
# not column-normalized
np.sum(T[:,0])


Out[18]:
1.7771448893957471

In [19]:
# let's compute the volume of each partition

def compute_cluster_volumes(total_volume,uniform_cluster_assignments):
    counts_per_cluster = np.zeros(uniform_cluster_assignments.max()+1)
    for x in uniform_cluster_assignments:
        counts_per_cluster[x] += 1.0
    fraction_in_cluster = counts_per_cluster / len(uniform_cluster_assignments)
    volumes = fraction_in_cluster * total_volume
    return volumes

total_volume = (uniform.max() - uniform.min())**(uniform.shape[1])
volumes = compute_cluster_volumes(total_volume,uniform_clusters)

In [20]:
T[0],volumes


Out[20]:
(array([ 0.393207  ,  0.09053321,  0.04219531,  0.05193111,  0.30181058,
         0.06873631,  0.02912721,  0.02245926]),
 array([ 54.26056109,  22.65506675,  30.24650534,   1.28473961,
         39.95067684,  26.96153191,  22.33556933,  27.30352915]))

In [21]:
k = 1.0 / volumes
k*T[0]


Out[21]:
array([ 0.00724664,  0.00399616,  0.00139505,  0.04042151,  0.00755458,
        0.00254942,  0.00130407,  0.00082258])

In [22]:
plt.imshow(T + T.T,interpolation='none',cmap='Blues');
plt.title("Transition matrix")


Out[22]:
<matplotlib.text.Text at 0x10d9dfb90>

In [23]:
plt.bar(range(len(k)),k)
plt.figure()
plt.bar(range(len(k)),volumes)


Out[23]:
<Container object of 8 artists>

In [24]:
# now let's score a trajectory

In [25]:
# prior probability *density* vector $\pi$
pi = msm.populations_*k
pi


Out[25]:
array([ 0.00510055,  0.00657795,  0.00241941,  0.0667501 ,  0.00538753,
        0.00424396,  0.00219583,  0.00133969])

In [26]:
msm.populations_


Out[26]:
array([ 0.27675852,  0.14902397,  0.07317867,  0.08575649,  0.21523553,
        0.11442358,  0.04904503,  0.03657821])

In [27]:
# log probabilities
log_pi = np.log(pi)
log_normalized_T = np.log(k*T)

In [28]:
def score_correctly_ll(sequence,log_pi,log_normalized_T):
    log_prob = log_pi[sequence[0]]
    for i in range(1,len(sequence)):
        log_prob += log_normalized_T[sequence[i-1],sequence[i]]
    return log_prob

In [29]:
score_correctly_ll(clusters,log_pi,log_normalized_T)


Out[29]:
-458418.19890994485

In [30]:
score_correctly_ll(kmed.transform([X_test])[0],log_pi,log_normalized_T)


Out[30]:
-50980.176168027727

In [31]:
# sanity check: what if we have a 1-state MSM?

# in MSMBuilder, we would first perform a clustering, then pass the sequence of cluster labels to be scored:
one_state_clusters = [np.zeros(len(clusters))]
msm = MarkovStateModel()
msm.fit(one_state_clusters)
print('MSMBuilder log likelihood: {0}'.format(msm.score_ll(one_state_clusters)))
# this leads to a probability of 1.0 for any observation sequence, even though the model contains no information

# for the corrected log_likelihood, we account for the volume of the cluster
log_pi = np.log(1.0/total_volume)
print('Corrected log likelihood: {0:.3f}'.format(log_pi*len(clusters)))
# that's thankfully a bit lower than the multi-state MSM log-likelihood-- sanity check passed!


MSM contains 1 strongly connected component above weight=1.00. Component 0 selected, with population 100.000000%
MSMBuilder log likelihood: 0.0
Corrected log likelihood: -487448.308

In [32]:
MarkovStateModel().fit([np.zeros(1000)]).score_ll([np.zeros(1000)])


MSM contains 1 strongly connected component above weight=1.00. Component 0 selected, with population 100.000000%
Out[32]:
0.0

In [33]:
from msmbuilder.example_datasets import FsPeptide

In [34]:
fs = FsPeptide().get().trajectories


loading trajectory_1.xtc...
loading trajectory_10.xtc...
loading trajectory_11.xtc...
loading trajectory_12.xtc...
loading trajectory_13.xtc...
loading trajectory_14.xtc...
loading trajectory_15.xtc...
loading trajectory_16.xtc...
loading trajectory_17.xtc...
loading trajectory_18.xtc...
loading trajectory_19.xtc...
loading trajectory_2.xtc...
loading trajectory_20.xtc...
loading trajectory_21.xtc...
loading trajectory_22.xtc...
loading trajectory_23.xtc...
loading trajectory_24.xtc...
loading trajectory_25.xtc...
loading trajectory_26.xtc...
loading trajectory_27.xtc...
loading trajectory_28.xtc...
loading trajectory_3.xtc...
loading trajectory_4.xtc...
loading trajectory_5.xtc...
loading trajectory_6.xtc...
loading trajectory_7.xtc...
loading trajectory_8.xtc...
loading trajectory_9.xtc...

In [35]:
from msmbuilder.featurizer import DihedralFeaturizer

In [36]:
dih = DihedralFeaturizer(['phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4']).fit_transform(fs)

In [37]:
dih[0].shape


Out[37]:
(10000, 148)

In [38]:
np.vstack(dih).max(),np.vstack(dih).min()


Out[38]:
(1.0, -1.0)

In [39]:
kmed_fs = MiniBatchKMedoids(n_clusters=10)
kmed_fs.fit(dih)


Out[39]:
MiniBatchKMedoids(batch_size=100, max_iter=5, max_no_improvement=10,
         metric='euclidean', n_clusters=10, random_state=None)

In [40]:
clusters_fs = kmed_fs.transform(dih)

In [41]:
# simple monte carlo: uniformly throw darts in a 148-dimensional cube
uniform_fs = np.array(npr.rand(1000000,dih[0].shape[1]),dtype='float32')*2 - 1
uniform_clusters_fs = kmed_fs.transform([uniform_fs])[0]

In [42]:
frac_volume = np.array([np.sum(uniform_clusters_fs==i) for i in range(10)],dtype=float) / len(uniform_clusters_fs)

In [43]:
plt.bar(range(10),frac_volume);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional volume')


Out[43]:
<matplotlib.text.Text at 0x10dccfcd0>

In [44]:
occupancy = np.array([np.sum(np.vstack(clusters_fs)==i) for i in range(10)],dtype=float)
frac_occupancy = occupancy / np.sum(occupancy)
plt.bar(range(10),frac_occupancy);
plt.xlabel('Cluster ID')
plt.ylabel('Fractional occupancy')


Out[44]:
<matplotlib.text.Text at 0x112e275d0>

In [45]:
point_density = frac_occupancy / frac_volume
plt.bar(range(10),point_density / np.min(point_density));
plt.xlabel('Cluster ID')
plt.ylabel('Relative point density')


Out[45]:
<matplotlib.text.Text at 0x10d0f37d0>

In [46]:
# in this case, bigger clusters are not necessarily more populous, with some clusters up to ~40x more dense with observed points
# than others

In [47]:
uniform_clusters_fs


Out[47]:
array([4, 3, 4, ..., 4, 8, 3])

In [48]:
# total volume: yikes! that's large!
total_volume_fs = 2.0**148
total_volume_fs


Out[48]:
3.5681192317649e+44

In [49]:
# Now we can score this MSM the same way as above...

volumes_fs = compute_cluster_volumes(total_volume_fs,uniform_clusters_fs)

In [50]:
volumes_fs


Out[50]:
array([  1.61996181e+43,   2.22532892e+43,   1.97491831e+43,
         4.67616298e+43,   4.76258283e+43,   3.48787223e+43,
         5.74734805e+43,   2.71405421e+43,   5.20731321e+43,
         3.26564976e+43])

In [51]:
msm_fs = MarkovStateModel(prior_counts=1)
msm_fs.fit(clusters_fs)


MSM contains 1 strongly connected component above weight=1.00. Component 0 selected, with population 100.000000%
Out[51]:
MarkovStateModel(ergodic_cutoff=1.0, lag_time=1, n_timescales=10,
         prior_counts=1, reversible_type='mle', sliding_window=True,
         verbose=True)

In [52]:
def compute_normalized_log_probs_from_msm(msm,volumes):
    k = 1.0/volumes
    pi = msm.populations_*k
    log_pi = np.log(pi)
    T = msm.transmat_
    log_normalized_T = np.log(k*T)
    return log_pi,log_normalized_T

In [53]:
log_pi_fs,log_normalized_T_fs = compute_normalized_log_probs_from_msm(msm_fs,volumes_fs)

In [54]:
lls = []
for traj in clusters_fs:
    lls.append(score_correctly_ll(traj,log_pi_fs,log_normalized_T_fs))

In [55]:
np.sum(lls)


Out[55]:
-28286089.012853153

Another proposed likelihood function: write $$\mathcal{L}(D | T, S) = \left[\prod_{n=1}^{N}T_{s_{n-1}}{s_n}\right] \left[\prod_{n=1}^{N} \frac{e^{-U(x_n)}}{\pi_{s_n} \right] \left[\prod_{n=1}^{N} \pi_{s_i}\right]$$


In [ ]: